type-stable inner loop for sqrtm #20214

felixrehren · 2017-01-24T14:38:37Z

As suggested by Ralph_Smith on discourse

On my machine: speedup x15

As suggested by Ralph_Smith on [discourse](https://discourse.julialang.org/t/review-schur-pade-matrix-powers-speedup/1650/6) On my machine: speedup x15

vchuravy · 2017-01-24T14:42:43Z

Do we have a benchmark for this in BaseBenchmarks?

felixrehren · 2017-01-24T16:01:45Z

I can't find a benchmark. What would that look like?

function b_20214(n=100,m=127)
    t = 0
    for i in 1:n
        A = randn(m,m)
        A = UpperTriangular(schurfact(A'*A)[:Schur])
        t += @timeit sqrtm(A)
    end
    t
end

andreasnoack · 2017-01-24T16:06:43Z

Take a look at https://github.com/JuliaCI/BaseBenchmarks.jl#contributing and https://github.com/JuliaCI/BaseBenchmarks.jl/blob/master/src/linalg/LinAlgBenchmarks.jl. The package will handle the timing. You should just provide the function.

stevengj · 2017-01-24T17:36:48Z

base/linalg/triangular.jl

@@ -1888,10 +1898,7 @@ function sqrtm{T}(A::UpperTriangular{T})
    for j = 1:n


It seems like we should just have sqrtm(A) call a sqrtm{realmatrix}(A,::Val{realmatrix}) function with either Val{true} or Val{false} so that every operation on R is type-stable, and then you won't need floop.

You can put @inbounds in front of this loop.

Done both 👍

You can just put the @inbounds in front of the for ... end group no need of the extra begin ... end here. That is,

@inbounds for j = 1:n # loop body end

stevengj · 2017-01-24T17:38:52Z

base/linalg/triangular.jl

@@ -1900,14 +1907,10 @@ end
 function sqrtm{T}(A::UnitUpperTriangular{T})


It seems like we could replace this with

sqrtm{T}(A::UnitUpperTriangular{T}) = sqrtm(A, Val{true})

in terms of the abovementioned method. Given a Val method that is type-stable, the only remaining reason to duplicate the code here seems to be to save the sqrt call for the diagonal elements, but I doubt this is worth the trouble since there are only O(n) such calls.

Will look at this tomorrow

Since UnitUpperTriangular{T} is not a subtype of UpperTriangular{T} (??), I think it would be

sqrtm(A::UnitUpperTriangular) = UnitUpperTriangular(sqrtm(UpperTriangular(A),Val{true}))

For some reason this is faster than the present method, which confuses me. This version involves two additional conversions and some unnecessary arithmetic operations in the loop (of order O(n) as you mention)! Are some ops on UnitUpperTriangular slower than the same op on UpperTriangular?

Will make this change later unless someone can enlighten me

(Julia doesn't support inheritance of concrete types.)

Indexing expressions A[i,j] are slower for UnitUpperTriangular than for UpperTriangular, and both are slower than for Array, because it has to check if i <= j and (in the unit case) whether i == j.

Since by construction you only access the upper triangle, I would do:

B = A.data

at the beginning of the function (for both the UnitUpperTriangular and UpperTriangular methods), and then use B[i,j] rather than A[i,j].

(There might be other methods operating on triangular matrix types that might benefit from a similar change.)

The @inbounds means that it never checks where i and j are in 1:n. It still checks whether i ≤ j, since if i > j then it just returns zero rather than looking at A.data.

Timing such a small operation is tricky; I would normally recommend BenchmarkTools instead of @time:

julia> using BenchmarkTools julia> A = rand(10,10); T = UpperTriangular(A); julia> @btime $A[3,5]; @btime $T[3,5]; 1.377 ns (0 allocations: 0 bytes) 1.650 ns (0 allocations: 0 bytes)

(It may be that the effect on the UpperTriangular case is too small to measure, however. Or maybe even LLVM is smart enough to figure out that i ≤ j is always true. But it wouldn't hurt to unwrap the type anyway.)

After improving the type stability, the specialised implementation for UnitUpperTriangular is faster than converting.
Now, on 0.5.0 using @benchmarking, I see a slowdown when unwrapping the type first in sqrtm. Relatedly, on my machine (juliabox.com) there appears to be a slowdown for the atomic operations:

@benchmark $A[1,2] median time: 2.650 ns @benchmark $T[1,2] median time: 2.223 ns

(what does the $ do here?)
This persists when wrapping these calls in a function. All in all I am hesitant to make the unwrapping change for sqrtm (but not against)

How did you unwrap the type? I hope you didn't do A = A.data, which hurts type stability.

I can't reproduce your benchmark results. Maybe the benchmarks aren't reliable for such a short operation, and one needs to put a bunch of indexing operations in a loop? (Probably you also want to time them with @inbounds). The $ splices the value of the variable directly into the expression, which gets rid of the penalty for benchmarking with a global variable (it means the compiler does not have to do runtime type inference to find the type of A).

For example:

julia> function sumupper(A) s = zero(eltype(A)) for j = 1:size(A,2), i = 1:j s += A[i,j] end return s end sumupper (generic function with 1 method) julia> A = rand(100,100); T = UpperTriangular(A); julia> using BenchmarkTools julia> sumupper(A) == sumupper(T) true julia> @btime sumupper($A); @btime sumupper($T); 4.122 μs (0 allocations: 0 bytes) 5.102 μs (0 allocations: 0 bytes)

@stevengj

As suggested by @stevengj

felixrehren · 2017-01-25T10:17:46Z

@stevengj great idea, implemented 👍 In my testing, I found that floop still offered a performance improvement when used in conjuction (that I couldn't replicate by "manually inlining" floop). Not sure why that would be, but for the time being I left the use of floop as is. Happy to modify further

@andreasnoack I'm preparing a PR for BaseBenchmarks 👍

tkelman · 2017-01-25T10:21:55Z

base/linalg/triangular.jl

@@ -1889,14 +1888,18 @@ function sqrtm{T}(A::UpperTriangular{T})
            end
        end
    end
+    sqrtm(A::UpperTriangular,Val{realmatrix})


the ::UpperTriangular type assertion isn't usually used on call sites

stevengj · 2017-01-25T12:25:40Z

base/linalg/triangular.jl

@@ -1879,19 +1888,20 @@ function sqrtm{T}(A::UpperTriangular{T})
            end
        end
    end
+    sqrtm(A,Val{realmatrix})
+end
+function sqrtm{T,realmatrix}(A::UpperTriangular{T},::Type{Val{realmatrix}})
    if realmatrix
        TT = typeof(sqrt(zero(T)))
    else
        TT = typeof(sqrt(complex(-one(T))))


just use sqrt(complex(zero(T))) here, since at some point one may return a dimensionless value if T is dimensionful.

stevengj · 2017-01-25T12:29:25Z

base/linalg/triangular.jl

@@ -1867,7 +1867,16 @@ function logm{T<:Union{Float64,Complex{Float64}}}(A0::UpperTriangular{T})
 end
 logm(A::LowerTriangular) = logm(A.').'

-function sqrtm{T}(A::UpperTriangular{T})
+function floop(x,R,i::Int,j::Int)
+    r = x


This is not type stable if A[i,j] and R[i,j] are not the same type. A simple fix is r = x + zero(eltype(R))

stevengj · 2017-01-25T12:32:26Z

@felixrehren, have you tried @code_warntype to see if there are any type instabilities in sqrtm(A, Val{...})?

stevengj · 2017-01-25T18:11:21Z

base/linalg/triangular.jl

+        for j = 1:n
+            R[j,j] = realmatrix ? sqrt(A[j,j]) : sqrt(complex(A[j,j]))
+            for i = j-1:-1:1
+                r = A[i,j] + zero(TT)


Even simpler:

r::TT = A[i,j]

Though (like a lot of our generic linear-algebra code), this isn't right for dimensionful quantities, because r has the units of A whereas R has the units of sqrt(A). If we want to get that right, it should really be something like:

TA = realmatrix ? float(T) : Complex{float(T)}

and then use r::TA.

To be clear, do you suggest replacing

TT = realmatrix ? typeof(sqrt(zero(T))) : typeof(sqrt(complex(zero(T))))

with

TT = realmatrix ? float(T) : Complex{float(T)}

or would TA be supplemental to TT? Because r should have the units of R = sqrtm(A)?

TA would be supplemental to TT, because A and R have different units.

Instead of introducing a new type variable then you can simply do

r = A[i,j] - zero(TT)*zero(TT)

which matches the operation in the loop and therefore correct. We should avoid explicit float calls if possible since it requires that float is implemented correctly for new types (which might not be the case).

I hope this is addressed -- I guess we should write some tests? (for a different PR ...)

stevengj · 2017-01-25T18:13:13Z

base/linalg/triangular.jl

    end
-    n = checksquare(A)
+    n = Base.LinAlg.checksquare(A)
    R = zeros(TT, n, n)


~~My inclination would be to use R = Matrix{TT}(n, n) here. Then get rid of the r == 0 check below. sqrtm is unlikely to return a sparse result, so I don't see the point of pre-initializing the array to zero.~~~

Oh, nevermind, I forgot that this is for the upper-triangular case. Then I guess initializing it to zero makes sense.

felixrehren · 2017-01-26T10:59:42Z

@stevengj Thanks for the comments, I am also learning a lot!
I used @code_warntype and, with the suggestions you made that are already implemented, there are no more warnings or unions in the output of sqrtm(A,Val{realmatrix})

I think the AppVeyor failure is unrelated?

Error in testset subarray:
Error During Test
  Test threw an exception of type ErrorException
  Expression: X[1,1:end - 2] == @views(X[1,1:end - 2])
  Partial linear indexing is deprecated. Use `reshape(A, Val{2})` to make the dimensionality of the array match the number of indices.

stevengj · 2017-01-26T14:21:26Z

The "Partial linear indexing" thing was just fixed by #20242

replace `TT` by `t` for the type of the sqrt of a variable of type `T` introduce `tt` as the type of the square of a variable of type `t` N.B. `tt` is not always the same as `T`, it could be `Complex{T}` In the `UnitUpperTriangular` case, some of the complexity should fall away; TODO?

tkelman · 2017-01-26T18:38:58Z

base/linalg/triangular.jl

+                @simd for k = i+1:j-1
+                    r -= R[i,k]*R[k,j]
+                end
+                r==0 || (R[i,j] = r/one)


one won't be equivalent to the former (R[i,i] + R[j,j]), will it?

You're right, thanks!

why is it equivalent to 2*R[1,1] ?

Because this is for the UnitUpperTriangular case only, R[i,i] = R[1,1]

Ah right. Missed the UnitUpperTriangular since that line was on the other side of a long conversation.

correct previous change, thx tkelman

ViralBShah · 2017-01-27T04:49:28Z

I know this is not a blocker for 0.6, but would be nice to get it in as we plan for an alpha release. Marking it 0.6 mainly as a reminder, but feel free to take off the 0.6 tag if necessary.

tkelman · 2017-01-27T12:54:47Z

Please don't put "nice to have" on the milestone.

felixrehren · 2017-01-27T13:53:37Z

Think this is it -- @stevengj, @andreasnoack ?

tkelman · 2017-01-27T13:59:08Z

base/linalg/triangular.jl

+                @simd for k = i+1:j-1
+                    r -= R[i,k]*R[k,j]
+                end
+                r==0 || (R[i,j] = half*r)


the old version of this function might not have worked correctly in this case either, but would this do the right thing if the element type was not commutative under multiplication?

See my comment above for the general case. However, even for non-commutative algebras the case here is special because R[i,i] is the identity, which always commutes with everything. So, the problem reduces to solving R[i,j] + R[i,j] = r. If the elements form a field over the reals, then the solution R[i,j] = r/2 is correct.

The case here is also wrong if e.g. the elements are matrices, in which case 1/(2*R[1,1]) will throw an error, although if you do half = inv(2*R[1,1]) it will work. However, it would be even better to just to R[i,j] = r/2 in that case, since it would avoid the matrix multiplication.

However, it is still wrong if the elements are some other number field where you can't do /2. The correct, general solution seems like it should be

half = inv(R[1,1] + R[1,1]) # don't use 1/(2R[1,1]) or half=0.5, to handle general algebraic types

stevengj · 2017-01-27T16:32:33Z

(Note that this PR can go in after the feature freeze since it is just a performance optimization.)

stevengj · 2017-01-27T17:07:41Z

base/linalg/triangular.jl

+                @simd for k = i+1:j-1
+                    r -= R[i,k]*R[k,j]
+                end
+                r==0 || (R[i,j] = r / (R[i,i] + R[j,j]))


Per @tkelman's comment below, for non-commutative number types (e.g. quaternions), this is not right.

I haven't gone through the algebra very carefully, but I think the final R[i,j] needs to solve R[i,i]*R[i,j] + R[i,j]*R[j,j] == r, i.e. a Sylvester equation. So, the right thing to do seems like:

R[i,j] = sylvester(R[i,i], R[j,j], -r)

and then add a method

sylvester(a::Union{Real,Complex},b::Union{Real,Complex},c::Union{Real,Complex}) = (-c) / (a + b)

for the commutative Number cases. New Number types will have then to provide their own sylvester method if they want sqrtm to work. (We already have sylvester methods for matrices thanks to #7435.)

stevengj · 2017-01-27T19:24:56Z

base/linalg/triangular.jl

+    sqrtm(A,Val{realmatrix})
+end
+# solve the sylvester equation a*x + x*b + c for x when a,b,x are commutative numbers. PR#20214
+sylvester(a::Union{Real,Complex},b::Union{Real,Complex},c::Union{Real,Complex}) = -c / (a + b)


Probably this sylvester definition should go into complex.jl or similar? And then you'll need to make sure that the LinAlg module imports Base.sylvester so that it extends the base definition.

I would parenthesize (-c). That way if you call sylvester(a, b, -r), the compiler will have a good shot at noticing that the two negations cancel.

Nevermind, it looks like LLVM is smart enough to eliminate the double negation either way:

julia> syl(a,b,c) = -c / (a + b) syl (generic function with 1 method) julia> foo(a,b,c) = syl(a,b,-c) foo (generic function with 1 method) julia> @code_llvm foo(1.0,2.0,3.0) define double @julia_foo_65538(double, double, double) #0 !dbg !5 { top: %3 = fadd double %0, %1 %4 = fdiv double %2, %3 ret double %4 }

stevengj · 2017-01-28T05:54:17Z

LGTM except that the definition of sylvester should go elsewhere. Maybe in linalg/dense.jl with the other sylvester methods for now.

stevengj · 2017-01-28T05:54:57Z

Also, you don't need to put PR #.... comments on lines... that's what git blame is for.

felixrehren · 2017-01-28T11:49:12Z

@stevengj Ok, will do. Should I keep any of the comments or clean it all up?

stevengj · 2017-01-28T16:37:45Z

You should have a comment on code that does something in a non-obvious way, to make sure no one changes it without realizing, but your comments don't have to mention the PR.

moved from triangular.jl

felixrehren · 2017-02-03T08:36:38Z

Think all comments so far are addressed, let me know if it could use further improvements

stevengj · 2017-02-03T12:35:50Z

LGTM. Is the overall speedup still 15x?

felixrehren · 2017-02-03T14:16:43Z

With all the nice tweaks, I see x20 in this benchmarking notebook on Julia 0.5

type-stable inner loop for sqrtm

abe1129

As suggested by Ralph_Smith on [discourse](https://discourse.julialang.org/t/review-schur-pade-matrix-powers-speedup/1650/6) On my machine: speedup x15

andreasnoack approved these changes Jan 24, 2017

View reviewed changes

stevengj reviewed Jan 24, 2017

View reviewed changes

kshyatt added linear algebra Linear algebra performance Must go faster labels Jan 24, 2017

dispatch sqrtm on real-or-not bool

a18f8da

As suggested by @stevengj

tkelman reviewed Jan 25, 2017

View reviewed changes

remove call site type assertion

a538250

felixrehren mentioned this pull request Jan 25, 2017

RFC add matrix types, add sqrtm benchmarks JuliaCI/BaseBenchmarks.jl#61

Merged

stevengj reviewed Jan 25, 2017

View reviewed changes

more type stability

5718a1c

stevengj reviewed Jan 25, 2017

View reviewed changes

tkelman reviewed Jan 26, 2017

View reviewed changes

felixrehren added 3 commits January 26, 2017 19:49

one -> two

ce830fe

correct previous change, thx tkelman

div 2 -> mult 1/2

a7e7ce8

typo

075fd38

ViralBShah added this to the 0.6.0 milestone Jan 27, 2017

tkelman removed this from the 0.6.0 milestone Jan 27, 2017

tkelman reviewed Jan 27, 2017

View reviewed changes

stevengj reviewed Jan 27, 2017

View reviewed changes

algebraicness, unwrapping

03ef1b4

stevengj reviewed Jan 27, 2017

View reviewed changes

felixrehren added 2 commits January 31, 2017 11:08

sylvester for numbers [ci skip]

7b990f1

moved from triangular.jl

move sylvester

e3cf0a9

stevengj approved these changes Feb 3, 2017

View reviewed changes

andreasnoack merged commit 9c067b6 into JuliaLang:master Feb 3, 2017

felixrehren deleted the fr-sqrt branch February 3, 2017 15:06

stevengj mentioned this pull request May 1, 2017

Factorization twice as slow on 0.6 than 0.5 #21624

Closed

		@@ -1888,10 +1898,7 @@ function sqrtm{T}(A::UpperTriangular{T})
		for j = 1:n

		@@ -1900,14 +1907,10 @@ end
		function sqrtm{T}(A::UnitUpperTriangular{T})

type-stable inner loop for sqrtm #20214

type-stable inner loop for sqrtm #20214

Conversation

felixrehren commented Jan 24, 2017

vchuravy commented Jan 24, 2017

felixrehren commented Jan 24, 2017 • edited Loading

andreasnoack commented Jan 24, 2017

stevengj Jan 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevengj Jan 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixrehren commented Jan 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevengj commented Jan 25, 2017

stevengj Jan 25, 2017 • edited Loading

Choose a reason for hiding this comment

felixrehren Jan 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevengj Jan 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixrehren commented Jan 26, 2017 • edited by stevengj Loading

stevengj commented Jan 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tkelman Jan 26, 2017 • edited Loading

Choose a reason for hiding this comment

ViralBShah commented Jan 27, 2017 • edited Loading

tkelman commented Jan 27, 2017

felixrehren commented Jan 27, 2017

Choose a reason for hiding this comment

stevengj Jan 27, 2017 • edited Loading

Choose a reason for hiding this comment

stevengj commented Jan 27, 2017

stevengj Jan 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevengj Jan 27, 2017 • edited Loading

Choose a reason for hiding this comment

stevengj commented Jan 28, 2017

stevengj commented Jan 28, 2017

felixrehren commented Jan 28, 2017

stevengj commented Jan 28, 2017

felixrehren commented Feb 3, 2017

stevengj commented Feb 3, 2017

felixrehren commented Feb 3, 2017

felixrehren commented Jan 24, 2017 •

edited

Loading

stevengj Jan 24, 2017 •

edited

Loading

stevengj Jan 26, 2017 •

edited

Loading

felixrehren commented Jan 25, 2017 •

edited

Loading

stevengj Jan 25, 2017 •

edited

Loading

felixrehren Jan 26, 2017 •

edited

Loading

stevengj Jan 25, 2017 •

edited

Loading

felixrehren commented Jan 26, 2017 •

edited by stevengj

Loading

stevengj commented Jan 26, 2017 •

edited

Loading

tkelman Jan 26, 2017 •

edited

Loading

ViralBShah commented Jan 27, 2017 •

edited

Loading

stevengj Jan 27, 2017 •

edited

Loading

stevengj Jan 27, 2017 •

edited

Loading

stevengj Jan 27, 2017 •

edited

Loading